Modeling Annotator Accuracies for Supervised Learning
نویسندگان
چکیده
Crowdsourcing [5] methods are quickly changing the landscape for the quantity, quality, and type of labeled data available to supervised learning. While such data can now be obtained more quickly and cheaply than ever before, the generated labels also tend to be far noisier due to limitations of current quality control mechanisms and processes. Given such noisy labels and a supervised learner, an important question to consider, therefore, is how labeling effort can be optimally utilized in order to maximize learner accuracy? For example, should we (a) label additional unlabeled examples, or (b) generate additional labels for labeled examples in order to reduce potential label noise [12]? In comparison to prior work, we show faster learning can be achieved for case (b) by incorporating knowledge of worker accuracies into consensus labeling [13]. Evaluation on four binary classification tasks with simulated annotators shows the empirical importance of modeling annotator accuracies.
منابع مشابه
Modeling Multiple Annotator Expertise in the Semi-Supervised Learning Scenario
Learning algorithms normally assume that there is at most one annotation or label per data point. However, in some scenarios, such as medical diagnosis and on-line collaboration, multiple annotations may be available. In either case, obtaining labels for data points can be expensive and time-consuming (in some circumstances groundtruth may not exist). Semi-supervised learning approaches have sh...
متن کاملModeling annotator expertise: Learning when everybody knows a bit of something
Supervised learning from multiple labeling sources is an increasingly important problem in machine learning and data mining. This paper develops a probabilistic approach to this problem when annotators may be unreliable (labels are noisy), but also their expertise varies depending on the data they observe (annotators may have knowledge about different parts of the input space). That is, an anno...
متن کاملModeling Annotator Rationales with Application to Pneumonia Classification
We present a technique to leverage annotator rationale annotations for ventilator assisted pneumonia (VAP) classification. Given an annotated training corpus of 1344 narrative chest X-ray reports, we report results for two supervised classification tasks: Critical Pulmonary Infection Score (CPIS) and the likelihood of Pneumonia (PNA). For both tasks, our training data contain annotator rational...
متن کاملActive Learning from Multiple Knowledge Sources
Some supervised learning tasks do not fit the usual single annotator scenario. In these problems, ground-truth may not exist and multiple annotators are generally available. A few approaches have been proposed to address this learning problem. In this setting active learning (AL), the problem of optimally selecting unlabeled samples for labeling, offers new challenges and has received little at...
متن کاملActive Learning from Crowds
Obtaining labels can be expensive or timeconsuming, but unlabeled data is often abundant and easier to obtain. Most learning tasks can be made more efficient, in terms of labeling cost, by intelligently choosing specific unlabeled instances to be labeled by an oracle. The general problem of optimally choosing these instances is known as active learning. As it is usually set in the context of su...
متن کامل